Data Description:

The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled). The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).

Domain:

Cement manufacturing

Context:

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

Attribute Information:

  • Cement : measured in kg in a m3 mixture
  • Blast : measured in kg in a m3 mixture
  • Fly ash : measured in kg in a m3 mixture
  • Water : measured in kg in a m3 mixture
  • Superplasticizer : measured in kg in a m3 mixture
  • Coarse Aggregate : measured in kg in a m3 mixture
  • Fine Aggregate : measured in kg in a m3 mixture
  • Age : day (1~365)
  • Concrete compressive strength measured in MPa

Learning Outcomes:

  • Exploratory Data Analysis
  • Building ML models for regression
  • Hyper parameter tuning

Import Libraries

In [423]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#%config InlineBackend.figure_format = 'retina'

from scipy.stats import zscore
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

# Import KNN Regressor machine learning library
from sklearn.neighbors import KNeighborsRegressor
# Import Decision Tree Regressor machine learning library
from sklearn.tree import DecisionTreeRegressor
# Import ensemble machine learning library
from sklearn.ensemble import (RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor,BaggingRegressor)
# Import support vector regressor machine learning library
from sklearn.svm import SVR
#Import the metrics
from sklearn import metrics
#Import the Voting regressor for Ensemble
from sklearn.ensemble import VotingRegressor
# Import stats from scipy
from scipy import stats
#importing the metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
#importing the K fold
from sklearn.model_selection import KFold
#importing the cross validation score
from sklearn.model_selection import cross_val_score
#importing the preprocessing library
from sklearn import preprocessing
# importing the Polynomial features
from sklearn.preprocessing import PolynomialFeatures
#importing kmeans clustering library
from sklearn.cluster import KMeans
from sklearn.utils import resample

Import Data

In [424]:
# import data
df=pd.read_csv("concrete (1).csv")
df.shape
Out[424]:
(1030, 9)

Data Types and Data set values Analysis

In [425]:
# get info on data avialable and compare it with details provided about data set

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1030 non-null   float64
 1   slag          1030 non-null   float64
 2   ash           1030 non-null   float64
 3   water         1030 non-null   float64
 4   superplastic  1030 non-null   float64
 5   coarseagg     1030 non-null   float64
 6   fineagg       1030 non-null   float64
 7   age           1030 non-null   int64  
 8   strength      1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
  • Total 1030 rows and 9 columns present
  • All data are numerical as expected from the description
  • As all values are numeric then there is less chances of data having text or symbols apart from few unexpected numbers if any as outliers
  • Non- NULL count is same for all column suggests no null values
  • We have 8 independent and 1 dependent values
  • Target Column: -Strength
In [426]:
#Analyze distribution
df.describe().T
Out[426]:
count mean std min 25% 50% 75% max
cement 1030.0 281.167864 104.506364 102.00 192.375 272.900 350.000 540.0
slag 1030.0 73.895825 86.279342 0.00 0.000 22.000 142.950 359.4
ash 1030.0 54.188350 63.997004 0.00 0.000 0.000 118.300 200.1
water 1030.0 181.567282 21.354219 121.80 164.900 185.000 192.000 247.0
superplastic 1030.0 6.204660 5.973841 0.00 0.000 6.400 10.200 32.2
coarseagg 1030.0 972.918932 77.753954 801.00 932.000 968.000 1029.400 1145.0
fineagg 1030.0 773.580485 80.175980 594.00 730.950 779.500 824.000 992.6
age 1030.0 45.662136 63.169912 1.00 7.000 28.000 56.000 365.0
strength 1030.0 35.817961 16.705742 2.33 23.710 34.445 46.135 82.6
  • All columns are qantitative
  • There is huge gap between min and max values for columns
  • Mean and 50% are not similar for many cloumns, represents skewness for those data.
  • Mean is higher for slag, ash,age compared to 50%, represents right skewness (long tail towards right)
  • many min values are at 0, need to validate its feasibility and practicality
In [427]:
#view the top 5 rows of data
df.head()
Out[427]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29
In [428]:
#view bottom 5 rows of data
df.tail()
Out[428]:
cement slag ash water superplastic coarseagg fineagg age strength
1025 135.0 0.0 166.0 180.0 10.0 961.0 805.0 28 13.29
1026 531.3 0.0 0.0 141.8 28.2 852.1 893.7 3 41.30
1027 276.4 116.0 90.3 179.6 8.9 870.1 768.3 28 44.28
1028 342.0 38.0 0.0 228.0 0.0 932.0 670.0 270 55.06
1029 540.0 0.0 0.0 173.0 0.0 1125.0 613.0 7 52.61
In [429]:
# view random 10 data from dataset
df.sample(10)
Out[429]:
cement slag ash water superplastic coarseagg fineagg age strength
490 491.0 26.0 123.0 210.0 3.9 882.0 699.0 7 33.49
264 366.0 187.0 0.0 191.3 6.6 824.3 756.9 28 65.91
262 266.0 114.0 0.0 228.0 0.0 932.0 670.0 270 51.73
435 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 14 48.40
434 331.0 0.0 0.0 192.0 0.0 1025.0 821.0 120 39.38
1015 298.0 0.0 107.0 186.0 6.0 879.0 815.0 28 42.64
702 139.9 132.6 103.3 200.3 7.4 916.0 753.4 28 36.44
541 145.0 116.0 119.0 184.0 5.7 833.0 880.0 28 29.16
213 212.0 0.0 124.8 159.0 7.8 1085.4 799.5 3 19.52
887 213.5 0.0 174.2 154.6 11.7 1052.3 775.5 56 51.43
  • With head, tail and random samples we can notice there is a diverse spread of data over all attributes.

Analyzing Attributes

In [430]:
fig, axs = plt.subplots(nrows = 9, ncols=2, figsize = (10,30))


for i, x in enumerate(df.columns):
    sns.distplot(df[x],ax=axs[i,0],bins=50, rug=True)
    sns.boxplot(orient='v', data=df[x], ax=axs[i, 1])

fig.tight_layout()
plt.show()
plt.clf()
plt.close()
#for colname in df.columns:
#    sns.distplot(df[colname])
  • Cement: has close to normal distribution (slight tail towards higher denominations of numbers) and no visual outliers present, mode seems to be around 200-250
  • Slag: mode of the distribution is clearly 0, rest of the dataset seems like right skewed as it has few higer numbers beyond 300. Few outliers alos visible beyond 350.
  • Ash: Similar to Slag, this also has a mode of 0, but inlike Slag it has a huge gap of values between 0 and 50 or 100. This suggests the value 0 might need a little more investigation to deicde if its a missing value or something else. No visible outliers noticed.
  • Water: ditributions has multiple peaks but its look spreaded across both side of peak. Few outliers present in both lower and higher side of values
  • Superplastic: mode is zero and its right skewed. few outlier present in higher side of values
  • Coarseagg: datapoints looks distributed and no outlier presence
  • Fineagg: Distribution seems normal but has few outlirers on higher values
  • Age: highly right skwed data points, lot of outlier presence
  • Strength: Data points seems distributed normally but few outlier presence towards higher values
In [ ]:
 
In [431]:
df.boxplot(figsize=(15, 10));
  • From box plot we can clearly see there is high presence of outliers on slag, water, superplastic, fineagg, age and strength
In [432]:
sns.pairplot(df, diag_kind='kde');
  • From intital view its difficult to see any strong liner or polynomial correlation among the attributes. We can observe ther are few week correlation present between few attributes. Crrelation heatmap shold provide more insights.
  • Majority of bi-variate graphs shows a cloud like structure
  • From diagonal analysis on kde plots, we can observe there is presence of minimum two distinct peaks in the independent attributes
In [433]:
df.hist(bins=30,figsize=(25, 10));
In [434]:
plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(),annot=True,cmap="cividis",linecolor="black" );
In [435]:
df.corr()['strength'].sort_values()
Out[435]:
water          -0.289633
fineagg        -0.167241
coarseagg      -0.164935
ash            -0.105755
slag            0.134829
age             0.328873
superplastic    0.366079
cement          0.497832
strength        1.000000
Name: strength, dtype: float64
  • From heat map we can clearly observe the linear relations.
  • Majority of attributes has week correlations ranging from -3 to +3.
  • Few strong -ve correlation are visible among attributes like between: superplastic & water, superplastic & ash, fineagg & water
  • from correlation details with strength we can notice highest positive/negative correlation is with cement, superplastic, age and water.
  • We can plan for removal of low correlated attributes down the line based on analysis.
In [436]:
sns.lmplot(x="cement",y="strength",data=df)
plt.show()
  • we can notice a clear positive correlation
In [437]:
#cement vs water
sns.lmplot(x="cement",y="water",data=df)
plt.show()
  • We cal clearly see the line is almost parallel to x axis, representing very minimal correlation between the attributes.
In [438]:
sns.lmplot(x="superplastic",y="water",data=df)
plt.show()
  • we can notice the strong negative correlation line
In [ ]:
 
In [439]:
sns.lmplot(x="ash",y="strength",data=df)
plt.show()
  • we can notice the line is almost paralle to the x axis.

Missing Values

In [440]:
df.isnull().sum()
Out[440]:
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64
  • No null values present

Analyzing Outliers and treatment

In [ ]:
 
In [441]:
df.boxplot(figsize=(10, 10));
  • From box plot we can clearly see there is high presence of outliers on slag, water, superplastic, fineagg, age and strength
In [442]:
# NUmber of outliers
In [443]:
for cols in df.columns[:-1]:
    q1=df[cols].quantile(0.25)
    q3=df[cols].quantile(0.75)
    iqr=q3-q1
    
    low=q1-1.5*iqr
    high=q3+1.5*iqr
    print('Outliers count for',cols, df.loc[((df[cols]>high) | (df[cols]<low)),cols].count())
    print('high value Outliers count for',cols, df.loc[(df[cols]>high),cols].count())
    print('low value Outliers count for',cols, df.loc[(df[cols]<low),cols].count())
Outliers count for cement 0
high value Outliers count for cement 0
low value Outliers count for cement 0
Outliers count for slag 2
high value Outliers count for slag 2
low value Outliers count for slag 0
Outliers count for ash 0
high value Outliers count for ash 0
low value Outliers count for ash 0
Outliers count for water 9
high value Outliers count for water 4
low value Outliers count for water 5
Outliers count for superplastic 10
high value Outliers count for superplastic 10
low value Outliers count for superplastic 0
Outliers count for coarseagg 0
high value Outliers count for coarseagg 0
low value Outliers count for coarseagg 0
Outliers count for fineagg 5
high value Outliers count for fineagg 5
low value Outliers count for fineagg 0
Outliers count for age 59
high value Outliers count for age 59
low value Outliers count for age 0
  • there is significant number of outliers presnt in age attribute compared to other attributes.
In [444]:
#Records containing outliers
for cols in df.columns[:-1]:
    q1=df[cols].quantile(0.25)
    q3=df[cols].quantile(0.75)
    iqr=q3-q1
    
    low=q1-1.5*iqr
    high=q3+1.5*iqr
    print()
    print('Outliers count for',cols)
    print(df.loc[((df[cols]>high) | (df[cols]<low)),:])
Outliers count for cement
Empty DataFrame
Columns: [cement, slag, ash, water, superplastic, coarseagg, fineagg, age, strength]
Index: []

Outliers count for slag
     cement   slag  ash  water  superplastic  coarseagg  fineagg  age  \
918   239.6  359.4  0.0  185.7           0.0      941.6    664.3   28   
990   239.6  359.4  0.0  185.7           0.0      941.6    664.3    7   

     strength  
918     39.44  
990     25.42  

Outliers count for ash
Empty DataFrame
Columns: [cement, slag, ash, water, superplastic, coarseagg, fineagg, age, strength]
Index: []

Outliers count for water
     cement   slag    ash  water  superplastic  coarseagg  fineagg  age  \
66    237.0   92.0   71.0  247.0           6.0      853.0    695.0   28   
263   236.9   91.7   71.5  246.9           6.0      852.9    695.4   28   
432   168.0   42.1  163.8  121.8           5.7     1058.7    780.1   28   
462   168.0   42.1  163.8  121.8           5.7     1058.7    780.1  100   
587   168.0   42.1  163.8  121.8           5.7     1058.7    780.1    3   
740   140.0  164.0  128.0  237.0           6.0      869.0    656.0   28   
789   168.0   42.1  163.8  121.8           5.7     1058.7    780.1   56   
826   139.7  163.9  127.7  236.7           5.8      868.6    655.6   28   
914   168.0   42.1  163.8  121.8           5.7     1058.7    780.1   14   

     strength  
66      28.63  
263     28.63  
432     24.24  
462     39.23  
587      7.75  
740     35.23  
789     32.85  
826     35.23  
914     17.82  

Outliers count for superplastic
      cement   slag  ash  water  superplastic  coarseagg  fineagg  age  \
44     531.3    0.0  0.0  141.8          28.2      852.1    893.7   91   
156    531.3    0.0  0.0  141.8          28.2      852.1    893.7   28   
232    469.0  117.2  0.0  137.8          32.2      852.1    840.5   56   
292    469.0  117.2  0.0  137.8          32.2      852.1    840.5   91   
538    531.3    0.0  0.0  141.8          28.2      852.1    893.7    7   
744    469.0  117.2  0.0  137.8          32.2      852.1    840.5    7   
816    469.0  117.2  0.0  137.8          32.2      852.1    840.5   28   
838    531.3    0.0  0.0  141.8          28.2      852.1    893.7   56   
955    469.0  117.2  0.0  137.8          32.2      852.1    840.5    3   
1026   531.3    0.0  0.0  141.8          28.2      852.1    893.7    3   

      strength  
44        59.2  
156       56.4  
232       69.3  
292       70.7  
538       46.9  
744       54.9  
816       66.9  
838       58.8  
955       40.2  
1026      41.3  

Outliers count for coarseagg
Empty DataFrame
Columns: [cement, slag, ash, water, superplastic, coarseagg, fineagg, age, strength]
Index: []

Outliers count for fineagg
     cement  slag  ash  water  superplastic  coarseagg  fineagg  age  strength
129   375.0  93.8  0.0  126.6          23.4      852.1    992.6   91      62.5
447   375.0  93.8  0.0  126.6          23.4      852.1    992.6    7      45.7
504   375.0  93.8  0.0  126.6          23.4      852.1    992.6    3      29.0
584   375.0  93.8  0.0  126.6          23.4      852.1    992.6   56      60.2
857   375.0  93.8  0.0  126.6          23.4      852.1    992.6   28      56.7

Outliers count for age
      cement   slag  ash  water  superplastic  coarseagg  fineagg  age  \
51     331.0    0.0  0.0  192.0           0.0      978.0    825.0  180   
64     332.5  142.5  0.0  228.0           0.0      932.0    594.0  365   
93     427.5   47.5  0.0  228.0           0.0      932.0    594.0  180   
99     237.5  237.5  0.0  228.0           0.0      932.0    594.0  180   
103    380.0    0.0  0.0  228.0           0.0      932.0    670.0  180   
133    236.0    0.0  0.0  193.0           0.0      968.0    885.0  365   
144    302.0    0.0  0.0  203.0           0.0      974.0    817.0  180   
149    380.0   95.0  0.0  228.0           0.0      932.0    594.0  270   
152    322.0    0.0  0.0  203.0           0.0      974.0    800.0  180   
157    198.6  132.4  0.0  192.0           0.0      978.4    825.5  360   
159    304.0   76.0  0.0  228.0           0.0      932.0    670.0  365   
198    266.0  114.0  0.0  228.0           0.0      932.0    670.0  365   
199    277.0    0.0  0.0  191.0           0.0      968.0    856.0  180   
207    190.0  190.0  0.0  228.0           0.0      932.0    670.0  180   
256    525.0    0.0  0.0  189.0           0.0     1125.0    613.0  270   
262    266.0  114.0  0.0  228.0           0.0      932.0    670.0  270   
270    500.0    0.0  0.0  200.0           0.0     1125.0    613.0  270   
297    475.0    0.0  0.0  228.0           0.0      932.0    594.0  270   
302    342.0   38.0  0.0  228.0           0.0      932.0    670.0  180   
312    236.0    0.0  0.0  193.0           0.0      968.0    885.0  180   
313    540.0    0.0  0.0  173.0           0.0     1125.0    613.0  270   
323    139.6  209.4  0.0  192.0           0.0     1047.0    806.9  360   
359    475.0    0.0  0.0  228.0           0.0      932.0    594.0  180   
361    277.0    0.0  0.0  191.0           0.0      968.0    856.0  360   
370    266.0  114.0  0.0  228.0           0.0      932.0    670.0  180   
393    342.0   38.0  0.0  228.0           0.0      932.0    670.0  365   
448    331.0    0.0  0.0  192.0           0.0      978.0    825.0  360   
465    427.5   47.5  0.0  228.0           0.0      932.0    594.0  365   
484    237.5  237.5  0.0  228.0           0.0      932.0    594.0  365   
539    304.0   76.0  0.0  228.0           0.0      932.0    670.0  180   
570    190.0  190.0  0.0  228.0           0.0      932.0    670.0  270   
581    525.0    0.0  0.0  189.0           0.0     1125.0    613.0  180   
594    339.0    0.0  0.0  197.0           0.0      968.0    781.0  180   
601    339.0    0.0  0.0  197.0           0.0      968.0    781.0  365   
620    332.5  142.5  0.0  228.0           0.0      932.0    594.0  180   
622    380.0   95.0  0.0  228.0           0.0      932.0    594.0  180   
623    380.0    0.0  0.0  228.0           0.0      932.0    670.0  270   
632    304.0   76.0  0.0  228.0           0.0      932.0    670.0  270   
642    198.6  132.4  0.0  192.0           0.0      978.4    825.5  180   
696    307.0    0.0  0.0  193.0           0.0      968.0    812.0  180   
713    190.0  190.0  0.0  228.0           0.0      932.0    670.0  365   
720    380.0   95.0  0.0  228.0           0.0      932.0    594.0  365   
721    500.0    0.0  0.0  200.0           0.0     1125.0    613.0  180   
754    254.0    0.0  0.0  198.0           0.0      968.0    863.0  365   
755    349.0    0.0  0.0  192.0           0.0     1047.0    806.0  360   
776    540.0    0.0  0.0  173.0           0.0     1125.0    613.0  180   
850    427.5   47.5  0.0  228.0           0.0      932.0    594.0  270   
861    310.0    0.0  0.0  192.0           0.0      970.0    850.0  180   
878    237.5  237.5  0.0  228.0           0.0      932.0    594.0  270   
900    254.0    0.0  0.0  198.0           0.0      968.0    863.0  180   
901    475.0    0.0  0.0  228.0           0.0      932.0    594.0  365   
919    310.0    0.0  0.0  192.0           0.0      970.0    850.0  360   
951    332.5  142.5  0.0  228.0           0.0      932.0    594.0  270   
957    307.0    0.0  0.0  193.0           0.0      968.0    812.0  365   
971    349.0    0.0  0.0  192.0           0.0     1047.0    806.0  180   
985    350.0    0.0  0.0  203.0           0.0      974.0    775.0  180   
995    380.0    0.0  0.0  228.0           0.0      932.0    670.0  365   
1017   139.6  209.4  0.0  192.0           0.0     1047.0    806.9  180   
1028   342.0   38.0  0.0  228.0           0.0      932.0    670.0  270   

      strength  
51       39.00  
64       41.05  
93       41.84  
99       36.25  
103      53.10  
133      25.08  
144      26.74  
149      41.15  
152      29.59  
157      44.30  
159      55.26  
198      52.91  
199      32.33  
207      46.93  
256      67.11  
262      51.73  
270      55.16  
297      42.13  
302      52.12  
312      24.10  
313      74.17  
323      44.70  
359      42.62  
361      33.70  
370      48.70  
393      56.14  
448      41.24  
465      43.70  
484      39.00  
539      50.95  
570      50.66  
581      61.92  
594      36.45  
601      38.89  
620      39.78  
622      40.76  
623      53.30  
632      54.38  
642      41.72  
696      34.49  
713      53.69  
720      43.70  
721      51.04  
754      29.79  
755      42.13  
776      71.62  
850      43.01  
861      37.33  
878      38.41  
900      27.63  
901      41.93  
919      38.11  
951      40.27  
957      36.15  
971      41.05  
985      32.72  
995      52.52  
1017     44.21  
1028     55.06  
  • we will be treating the columns with same logic used in box plot using IQR
  • We are replacing the Outliers with boundry values to keep the higer values colse to higher and lower boundries.(Capping)
  • Goal is to keep the value inside acceptable ranges without loosing its main charatcteristic of being outlier or being a higher/lower number. Using this approach we will slightly increase the frequency on boundry values but they will be inside range. We have to keep an eye ot for a bigger group formation towards the boundry or a peak in graphs.
  • Replacing with Mean or median might miss represent the main characteristics of the data set which is pushing it to be be outlier.(Introducing bias by making it a normal curve)
  • We will be doing this only once, we may see new outliers as the IQR might change
In [445]:
#- selecting all but leaving out the last columns whihc is our target
for cols in df.columns[:-1]:
    q1=df[cols].quantile(0.25)
    q3=df[cols].quantile(0.75)
    iqr=q3-q1
    
    low=q1-1.5*iqr
    high=q3+1.5*iqr
    
    df.loc[(df[cols]<low),cols]=low
    df.loc[(df[cols]>high),cols]=high
In [ ]:
 
In [446]:
#Rechecking the outliers and qartiles
df.boxplot(figsize=(10, 8));
  • No more outliers present in the independent columns
In [447]:
#to capture if we are getting any additional peaks or bulges on the due to outlier treatments
sns.pairplot(df, diag_kind='kde');
  • Peaks are almost similar to what we have noticed earlier which suggestes our outlier treatments has not affected highly towards the borders.
In [ ]:
 
In [ ]:
 

Feature Engineering, Model Building and Model Tuning

  • All the feature showed less correlation internally,
  • few features have very less correlation with Strength as well which we can identify and eliminate
  • we will be using lasso to identify which all features have 0 contributing factor and
  • compare with what we have analyzed during correlation heatmap analysis
In [449]:
#Scaling data set


df_scaled = preprocessing.scale(df)
df_scaled=pd.DataFrame(df_scaled,columns=df.columns)
In [450]:
df_scaled.describe().T
Out[450]:
count mean std min 25% 50% 75% max
cement 1030.0 -3.858833e-16 1.000486 -1.715253 -0.850053 -0.079152 0.658961 2.477915
slag 1030.0 3.915961e-16 1.000486 -0.856971 -0.856971 -0.601823 0.800911 3.287734
ash 1030.0 3.634633e-16 1.000486 -0.847144 -0.847144 -0.847144 1.002278 2.281084
water 1030.0 4.909773e-16 1.000486 -2.700633 -0.784513 0.162941 0.492900 2.409020
superplastic 1030.0 -1.237414e-16 1.000486 -1.061968 -1.061968 0.041549 0.696762 3.334858
coarseagg 1030.0 7.116206e-16 1.000486 -2.212138 -0.526517 -0.063294 0.726761 2.214224
fineagg 1030.0 1.200334e-15 1.000486 -2.249277 -0.532607 0.075967 0.633775 2.383350
age 1030.0 -2.931851e-16 1.000486 -1.036502 -0.868740 -0.281572 0.501319 2.556406
strength 1030.0 -3.729487e-17 1.000486 -2.005552 -0.725131 -0.082225 0.617874 2.801717
In [451]:
#spliting dependent and independent variable

X=df_scaled.iloc[:,0:8]
y = df_scaled.iloc[:,8]
In [452]:
# Training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 50)
In [ ]:
 
In [ ]:
 
In [453]:
# OLS - LinearRegression
# We have noticed earlier that all the features were not very strong predictors of strength (max corr was 66% with cement)
# we are expecting our Linear model to be a weak predictor
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[idx]))
    
intercept = regression_model.intercept_

print("\nThe intercept for our model is {}".format(intercept))
The coefficient for cement is 0.687079071622048
The coefficient for slag is 0.4688245716075308
The coefficient for ash is 0.273743429541394
The coefficient for water is -0.19634629219088048
The coefficient for superplastic is 0.11990314082446604
The coefficient for coarseagg is 0.050509912617111075
The coefficient for fineagg is 0.05195168066138428
The coefficient for age is 0.5342967111489773

The intercept for our model is -0.015445641992344366
In [454]:
# predict mileage (mpg) for a set of attributes not in the training or test set
y_pred = regression_model.predict(X_test)

# Since this is regression, plot the predicted y value vs actual y values for the test data
# A good model's prediction will be close to actual leading to high R and R2 values
plt.scatter(y_test, y_pred);
  • We can notice the actual and predicted values are spread widely and not concentrated towards the diagonal
In [ ]:
 
In [455]:
#Ridge
ridge = Ridge(alpha=.3)
ridge.fit(X_train,y_train)
#print ("Ridge model:", (ridge.coef_))

for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, ridge.coef_[idx]))
The coefficient for cement is 0.6833000397825236
The coefficient for slag is 0.4651037309663957
The coefficient for ash is 0.27045160206958585
The coefficient for water is -0.19887559390630924
The coefficient for superplastic is 0.11995178760370528
The coefficient for coarseagg is 0.04792754536246317
The coefficient for fineagg is 0.04863082614623716
The coefficient for age is 0.533940915182883
In [456]:
#Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train,y_train)
#print ("Lasso model:", (lasso.coef_))
featureAnalysisLasso=lasso.coef_
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, featureAnalysisLasso[idx]))
    
#we will use the lasso coeff details later as well for feature selecction
The coefficient for cement is 0.3831422264911074
The coefficient for slag is 0.14327032769990528
The coefficient for ash is 0.0
The coefficient for water is -0.10314623960719586
The coefficient for superplastic is 0.20588302963470465
The coefficient for coarseagg is -0.0
The coefficient for fineagg is -0.0
The coefficient for age is 0.38821596614532583
In [ ]:
 
In [457]:
#Comparing all linear regression details
In [458]:
print(regression_model.score(X_train, y_train))
print(regression_model.score(X_test, y_test))
0.7155697850496803
0.7522624525775197
In [459]:
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
0.7155675478286141
0.7521580675301119
In [460]:
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))
0.6339560018439552
0.6297132969043879
In [ ]:
 
  • As expected from linear correlation details, OLS(linear regression) is not a strong predictor and able provide 71.5% accuracy with Train and 75.2% with test set.
  • With Ridge we can notice the results are very close to OLS
  • With Lasso we can notice a significant drop in accuracy as it has dropped 3 features from its model consideration (fineagg, coarseagg, ash)
  • we will try with polynomial approach to see if there is a significant cahnge in it.
  • From the graph analysis we haven't noticed any variable which shows a polynomial relation as well but we will git it a try.
In [ ]:
 
In [ ]:
 
In [461]:
#Polynomial approach and complexity
In [462]:
from sklearn.preprocessing import PolynomialFeatures

#going with degree 2 only
poly = PolynomialFeatures(degree = 2, interaction_only=True)

#poly = PolynomialFeatures(2)
#X is already scaled and will use it
X_poly = poly.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.30, random_state=50)
X_train.shape
Out[462]:
(721, 37)
  • Shape of train data set changed to 721, 37
  • we have 37 cloumns now instead of 8 (close to 5 fold increase in variables)
In [463]:
regression_model.fit(X_train, y_train)
print(regression_model.coef_)
[-3.45676167e-17  7.63842259e-01  5.59426100e-01  3.14916745e-01
 -1.40138424e-01  1.80612170e-01  4.79398919e-02  8.23101832e-02
  6.10493094e-01  5.07800388e-02  1.40290754e-01 -2.07662063e-01
 -1.51450828e-01  1.76774733e-02  6.82284381e-02  1.67770699e-01
  1.52505090e-01 -1.11766700e-01 -5.52470204e-02 -3.52530769e-02
  1.21687939e-01  2.55486067e-01 -9.65153210e-02 -1.17058481e-01
  6.33339299e-02  1.59881917e-01  2.24809010e-01  5.02942855e-02
 -7.53965034e-02 -3.43186790e-02  6.29538769e-03  6.02171450e-02
 -2.11009823e-02 -7.07269397e-03  3.79105697e-02  4.00135323e-02
  1.21108664e-01]
In [464]:
print(regression_model.score(X_train, y_train))
print(regression_model.score(X_test, y_test))
0.79219586327856
0.7990615016046269
In [465]:
ridge = Ridge(alpha=.3)
ridge.fit(X_train,y_train)
print ("Ridge model:", (ridge.coef_))
Ridge model: [ 0.          0.75585672  0.550645    0.30804927 -0.14591619  0.18045799
  0.04297793  0.07511162  0.61015136  0.04973342  0.13953737 -0.20436572
 -0.1471665   0.01820254  0.06813435  0.16341677  0.15016155 -0.10906613
 -0.05159931 -0.03576576  0.12139974  0.25154704 -0.09358242 -0.11475305
  0.06257324  0.15940097  0.2204412   0.05213395 -0.07349627 -0.03359663
  0.00382806  0.06272331 -0.01840402 -0.00628587  0.03807741  0.03869497
  0.11763446]
In [466]:
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
0.7921875539041159
0.7990933613207326
In [467]:
lasso = Lasso(alpha=0.01)
lasso.fit(X_train,y_train)
print ("Lasso model:", (lasso.coef_))
Lasso model: [ 0.00000000e+00  6.19167720e-01  3.84018819e-01  1.46842812e-01
 -1.96855565e-01  1.96327563e-01 -0.00000000e+00 -0.00000000e+00
  5.88565563e-01 -0.00000000e+00  3.20126744e-02 -6.61518442e-02
 -5.50190317e-02  3.53252915e-04  3.40603118e-02  0.00000000e+00
  2.68193431e-02  0.00000000e+00  4.38918622e-03 -4.93761830e-02
  8.74383138e-02  1.05439054e-01 -0.00000000e+00 -6.61024913e-02
  0.00000000e+00  7.91283438e-02  6.02738755e-02  4.16348686e-02
 -2.29719640e-02 -0.00000000e+00 -4.88233267e-02  9.76708305e-02
  0.00000000e+00  3.76581432e-02  3.00596195e-02 -0.00000000e+00
 -0.00000000e+00]
In [468]:
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))
0.7785174642513701
0.7868074088721237
  • Comparing the results of polynomial features with linear, ridge and Lasso, we can notice there is significant increase in the accuracy for the traina nd test data set.
  • For Lasso we can notice a significant jump in accuracy from 63% to 78% (eliminating 12 out of 37 features).
  • Polynomial features are able to dig out relations which were not visible in the data analysis phase but we also observed multi fold increase in features which also increases the complexity of the model. We may see more improvement by increaseing the degree of the polynomial features but the complexity will be too high then. We are ending our analysis with polynomial features here and will try out few other models to see if we are able to get better numbers.
In [ ]:
 
Explore for gaussians. If data is likely to be a mix of gaussians, explore individual clusters and present your findings in terms of the independent attributes and their suitability to predict strength
In [469]:
sns.pairplot(df_scaled,diag_kind='kde')
Out[469]:
<seaborn.axisgrid.PairGrid at 0x1d50301cb48>
  • We can clearly notice there are minimum two gausian or clear peaks visible in the data sets (slag, ash, superplastic, age). WE max observe max of 3 peaks for Age but its comapratively smaller then others.
  • Only field which has clear linear relationship with strength is cement, for all other feture we can see the distibution is spread across strength on y axis for any given point of feature value in x.
  • even in the cloud formation in the graphs, we can observe there is two clear distinct groups froming one which is close to zero and rest others
  • Based on above pairplot and below correlation details with strength (dependent) column
  • Cement: -- Cement seems to have clear and maximum positive correlation, data distribution looks close to normal. This feature should be included in the model preparation.
  • Age: -- For age we can clearly notice formation of groups for strength at diff age levels and is a positive correlation. Seems to have multipe gaussians present.This feature should be included in the model preparation.
  • superplastic: -- superplastic seems to have positive correlation but comaparatively less than cement or age from graphs, seems to have two gaussian or groups or peaks present from the graphs analysis. This feature should be included in the model preparation.
  • slag: -- slag seem to have a weak relation with strenght. There is clear indication of two gaussian. While comparing with strength we can notice there is two group formation one less than 0 and other above it. This has some significance but this feature is on the list for elimination. We will finalize the details based on further analysis.
  • ash: -- This feature has clearly weak correlation with strength. Two gaussinas are clealry visible. On the scatter plot with strength we can clearly see two distinct cloud formations but both doesnt have a clear head or tail. This is feature will not be beneficial in the model preparation.
  • coarseagg,fineagg: -- both these features have similar data distribution and correlation values. From Correlation details with strength we can notice they have negative correlation and are better than ash but visually its very hard to identify any head or tail for it. This is feature will not be much beneficial in the model preparation but will keep it for further analysis as they have higher correlation value than ash and slag.
  • water: -- water has high negative correlation comapred to other features having negative correlation. We can notice 3 gaussians in the distribution. On the scatter plot with strength we can notice thre group formation on less than -2, more than +2 and in between -2 to +2. This feature should be included in the model preparation.
In [470]:
df_scaled.corr().strength.sort_values()
Out[470]:
water          -0.291203
fineagg        -0.169584
coarseagg      -0.164935
ash            -0.105755
slag            0.134859
superplastic    0.366375
age             0.469625
cement          0.497832
strength        1.000000
Name: strength, dtype: float64
In [ ]:
 
In [471]:
# Trying to see how many groups we can capture with kmeans & GaussianMixture models
from sklearn.cluster import KMeans

cluster_range=range(1,15)
cluster_error=[]

for num_clusters in cluster_range:
    cluster=KMeans(num_clusters,n_init=20)
    cluster.fit(temp)
    labels=cluster.labels_
    centroid=cluster.cluster_centers_
    cluster_error.append(cluster.inertia_)

clusters_df = pd.DataFrame({"num_clusters": cluster_range, "cluster_errors": cluster_error})
clusters_df[0:15]

from matplotlib import cm
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.plot(clusters_df.num_clusters, clusters_df.cluster_errors, marker = "X")
Out[471]:
[<matplotlib.lines.Line2D at 0x1d5704699c8>]
  • Kmeans shows a weak prediction (a curve), which suggests its not able to identify groups in data sets clearly
  • but from kmeans clustering we can see the elbow formation at 2
In [ ]:
 
In [ ]:
 
In [472]:
from copy import deepcopy
temp=deepcopy(df_scaled.drop(columns=['strength'], axis=1))
temp2=deepcopy(df_scaled)
In [473]:
#Splitting the data in 2 set
#pandas will cut the data in 2 ranges and convert the continuous data to categorical for this purpose
val2=pd.cut(temp2.strength, bins=2, labels=np.arange(2), right=False)
In [474]:
# training gaussian mixture model 
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2, random_state=50)
gmm.fit(temp)
Out[474]:
GaussianMixture(covariance_type='full', init_params='kmeans', max_iter=100,
                means_init=None, n_components=2, n_init=1, precisions_init=None,
                random_state=50, reg_covar=1e-06, tol=0.001, verbose=0,
                verbose_interval=10, warm_start=False, weights_init=None)
In [475]:
#predictions from gmm
labels = gmm.predict(temp)
frame = pd.DataFrame(temp)
frame['cluster'] = labels
frame['stregth_cluster']=val2
frame.columns = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'cluster','stregth_cluster']

color=['blue','green','cyan', 'black','red']
for k in range(0,2):
    data = frame[frame["cluster"]==k]
    plt.scatter(data["cement"],data["water"],c=color[k])
plt.show()
#plotting the clusters to identify if cluters created from  modela nd clusters created in data are any match
#plot with split as per model prediction
In [476]:
for k in range(0,2):
    data = frame[frame["stregth_cluster"]==k]
    plt.scatter(data["cement"],data["water"],c=color[k])
plt.show()
#Plot with real split with strength
  • we can see there is a significant similarity on identifying the dots towards far right by model but with the cloud we can see a mix of all data.
In [477]:
frame.head(10)
Out[477]:
cement slag ash water superplastic coarseagg fineagg age cluster stregth_cluster
0 -1.339017 1.601727 -0.847144 1.034976 -1.061968 -0.014398 -0.312618 -0.281572 1 0
1 -1.074790 -0.367551 1.096078 -1.095618 0.800217 1.388141 0.285302 -0.673017 0 0
2 -0.298384 -0.856971 0.648965 0.276070 -0.113633 -0.206121 1.100078 -0.281572 0 0
3 -0.145209 0.465159 -0.847144 2.189833 -1.061968 -0.526517 -1.296616 -0.281572 1 1
4 -1.209776 1.270035 -0.847144 0.554178 0.507095 0.958372 -0.961932 -0.281572 1 0
5 -0.250517 -0.856971 -0.847144 0.492900 -1.061968 -1.069519 2.150512 1.451972 1 0
6 -1.094894 2.044757 -0.847144 1.034976 -1.061968 0.034498 -1.013325 -0.868740 1 0
7 -0.284981 -0.856971 1.002278 0.327920 0.041549 0.713893 -0.197296 0.501319 0 0
8 0.141995 -0.856971 -0.847144 0.492900 -1.061968 1.442184 -0.105790 -0.281572 1 0
9 -1.207861 1.276994 1.388421 0.587174 0.489853 -1.195619 -0.933101 -0.281572 0 0
In [478]:
from IPython.display import display
In [479]:
ct = pd.crosstab(frame['cluster'], frame['stregth_cluster'])
display(ct)
stregth_cluster 0 1
cluster
0 331 133
1 380 186
  • From cross tab we can see there is a clear mismatch in groups and there are significant laps in recall and preceission.
In [ ]:
 
In [ ]:
 

Feature Importance and feature selection

In [480]:
#Feature selection
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
In [481]:
# Build Lin Reg  to use in feature selection
linR = LinearRegression()

# Training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 50)

# Build step forward feature selection
# we have choosen 5 as from our earlier exploration with lasso and feature analysis
sfs1 = sfs(linR, k_features=5, forward=True, scoring='r2', cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train.values, y_train.values)
In [482]:
sfs1.get_metric_dict()
Out[482]:
{1: {'feature_idx': (0,),
  'cv_scores': array([0.15771123, 0.24519912, 0.23424296, 0.30561598, 0.2957915 ]),
  'avg_score': 0.24771215559675958,
  'feature_names': ('0',),
  'ci_bound': 0.06789543229025541,
  'std_dev': 0.05282495295280584,
  'std_err': 0.026412476476402918},
 2: {'feature_idx': (0, 7),
  'cv_scores': array([0.4277372 , 0.42436636, 0.36149161, 0.4880218 , 0.44021138]),
  'avg_score': 0.42836566909619495,
  'feature_names': ('0', '7'),
  'ci_bound': 0.05201127684151155,
  'std_dev': 0.04046654037671599,
  'std_err': 0.020233270188357994},
 3: {'feature_idx': (0, 4, 7),
  'cv_scores': array([0.60562274, 0.57727829, 0.56755432, 0.61332534, 0.55142961]),
  'avg_score': 0.5830420588969424,
  'feature_names': ('0', '4', '7'),
  'ci_bound': 0.029864153613342534,
  'std_dev': 0.02323532609463339,
  'std_err': 0.011617663047316693},
 4: {'feature_idx': (0, 1, 4, 7),
  'cv_scores': array([0.70679328, 0.61859499, 0.63448202, 0.66787333, 0.64208424]),
  'avg_score': 0.6539655718211146,
  'feature_names': ('0', '1', '4', '7'),
  'ci_bound': 0.03964052498000154,
  'std_dev': 0.030841675153361465,
  'std_err': 0.01542083757668073},
 5: {'feature_idx': (0, 1, 3, 4, 7),
  'cv_scores': array([0.73811084, 0.6195117 , 0.68379764, 0.68030777, 0.67353024]),
  'avg_score': 0.6790516388351515,
  'feature_names': ('0', '1', '3', '4', '7'),
  'ci_bound': 0.04839109919951491,
  'std_dev': 0.03764991918190965,
  'std_err': 0.018824959590954825}}
In [483]:
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt

fig = plot_sfs(sfs1.get_metric_dict())

plt.title('Sequential Forward Selection (w. R^2)')
plt.grid()
plt.show()
In [484]:
# Which features?
columnList = list(X_train.columns)
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)
[0, 1, 3, 4, 7]
In [485]:
subsetColumnList = [columnList[i] for i in feat_cols] 
print(subsetColumnList)
['cement', 'slag', 'water', 'superplastic', 'age']
  • Above is the list of features for model selection based on importance.
In [486]:
#Lasso details for feature 
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, featureAnalysisLasso[idx]))
The coefficient for cement is 0.3831422264911074
The coefficient for slag is 0.14327032769990528
The coefficient for ash is 0.0
The coefficient for water is -0.10314623960719586
The coefficient for superplastic is 0.20588302963470465
The coefficient for coarseagg is -0.0
The coefficient for fineagg is -0.0
The coefficient for age is 0.38821596614532583
  • From details from Lasso feature selection, SequentialFeatureSelector and correlation Heat map we can clearly identify the 5 features we can go with for further model building and testing without loosing too much data.
  • Features identifed for Model - 'cement', 'slag', 'water', 'superplastic', 'age'
In [ ]:
 
In [487]:
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 50)
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train , y_train)
Out[487]:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
In [488]:
#printing the feature importance
print('Feature importances: \n',pd.DataFrame(dt_model.feature_importances_,columns=['Imp'],index=X_train.columns))
Feature importances: 
                    Imp
cement        0.362708
slag          0.089359
ash           0.012770
water         0.107049
superplastic  0.042246
coarseagg     0.036788
fineagg       0.036724
age           0.312356
  • From decission tree regressor as well we can notice we are ending with same features, which are cement,slag,water,superplastic, age

Model Building

  • During Linear correlation analysis, we have noticed there is very less attributes in data set which have high correlation or pridicting power for target attribute.
  • During our linear and polynomial analysis we have noticed the accuracy are not too high.
  • Polynomial feature were able to provide slightly improved results compared to linear but with high number of attributes.
  • Support vector regressor or KNN are strong models for classification, they have significant capabilities for regression problems as welll but doesnt seems like a suitable candiate for this regression predictions.
  • Considering above details, we will put our model building efforts on ensamble methods
  • We are proceeding with those attributes only which have significant contribution on targets predictions
In [578]:
modelComp=pd.DataFrame()
In [570]:
X=df_scaled.iloc[:,0:8]
y = df_scaled.iloc[:,8]
seed=50
num_folds = 50
#Removing less contributing features
X=X.drop(['ash','coarseagg','fineagg'],axis=1)

# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = seed)
In [491]:
X.columns
Out[491]:
Index(['cement', 'slag', 'water', 'superplastic', 'age'], dtype='object')
In [492]:
#Decission tree Regreesor with default parameters and defaul values
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train , y_train)

y_pred = dt_model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',dt_model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',dt_model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))
Performance on training data using DT: 0.9931804455645835
Performance on testing data using DT: 0.8578004399700222
Accuracy Test:  0.8578004399700222
MSE:  0.15486533683354078
In [493]:
from scipy.stats import pearsonr  
sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg");
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\axisgrid.py:1848: UserWarning: JointGrid annotation is deprecated and will be removed in a future release.
  warnings.warn(UserWarning(msg))
  • For decission tree with default sets and we can see that training set accuracy is close 99.5% but test set accuracy is 84.7%. This variation suggests we have an overfitting model.
  • with pearsonr plotting as well we can see the values are spread far from the expected line.
In [494]:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
model1 = dt_model
results = cross_val_score(model1, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
[ 0.37889328  0.7643922   0.8716924   0.33767375  0.94249112  0.78272312
  0.87737798  0.94530523  0.7332956   0.8055669   0.81426186  0.72937867
  0.74586675  0.8363719   0.82006693  0.77945095  0.7552987   0.83393711
  0.91752089  0.74382966  0.95359906  0.91935805  0.9164501   0.87671255
  0.80486146  0.81037112  0.94375624  0.82161575  0.76257337  0.8444418
  0.92983672  0.8324531   0.9326758   0.87710803  0.61444506  0.71333564
  0.75245275  0.84159355  0.96007642  0.63785833  0.83382825  0.74091084
  0.81268228  0.82890815  0.71325078 -0.59545263  0.77588295  0.77433585
  0.636522    0.72466912]

 Average model Accuracy: 76.813% with std. dev - (22.998%)
  • This model with default set is not a suitable model as it has 77% average accuracy but having a 23% variance, from detils we can notice it predicts values as low as 60% and for few data sets its around 95%. This suggests the model is overfitted and will trade poorely with many data sets.

Prunning Decission tree

In [579]:
#Regularizing Decission Tree with diff values
model = DecisionTreeRegressor( max_depth = 8,random_state=seed,min_samples_leaf=4)
model.fit(X_train , y_train)
Out[579]:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=8,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=4, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=50, splitter='best')
In [580]:
y_pred = model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
#model1 = model
results = cross_val_score(model, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))


modelComp=modelComp.append(pd.DataFrame({'Model':['Decission Tree'],
                                         'Train Accuracy':[model.score(X_train,y_train)],
                                         'Test Accuracy':[model.score(X_test , y_test)],
                                         'Kfold-Mean-Accuracy':[results.mean()],
                                         'Kfold-StdDeviation':[results.std()]}))
Performance on training data using DT: 0.9157578226774162
Performance on testing data using DT: 0.8221936583519681
Accuracy Test:  0.8221936583519681
MSE:  0.1936436300130399
[ 0.58352791  0.77751064  0.86693782  0.66697265  0.9263555   0.74800387
  0.83358359  0.94576327  0.68346975  0.79488526  0.82637244  0.62455056
  0.70040223  0.8539491   0.68544945  0.91763889  0.68671081  0.76368468
  0.90720833  0.70643679  0.82857835  0.91097802  0.8699626   0.86715775
  0.68934431  0.75213356  0.87448738  0.87317311  0.94202855  0.66579481
  0.95999579  0.80179991  0.60090191  0.84439319  0.63390923  0.73750533
  0.65727273  0.90215302  0.81701375  0.38326336  0.81837513  0.83445852
  0.78628001  0.67565402  0.60729625 -0.30142936  0.751876    0.83082967
  0.56846557  0.83216219]

 Average model Accuracy: 75.030% with std. dev - (19.086%)
  • With Prunned decission tree we can see the Train and test data sets have close by accuracies. This reduces our Overfitting
  • But considering the cross val score and details - we can notice it still has very high std. deviation. But there is still a high gap between the train and test accuracies and we are still ending up with overfitting model. Any further prunning of the tables leads to reduced test and cross val accuracies.
  • We will move ahead with some other models and will try for Grid search on the models which have better performances in next steps.
Random Forest
In [497]:
#Random FOrest With default config
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

model.fit(X_train, y_train)
Out[497]:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
In [498]:
y_pred = model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
#model1 = model
results = cross_val_score(model, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Performance on training data using DT: 0.9812988609705172
Performance on testing data using DT: 0.8997334146445637
Accuracy Test:  0.8997334146445637
MSE:  0.10919737382411805
[ 0.85792824  0.93317926  0.87581829  0.6992177   0.94449582  0.787913
  0.90765613  0.9535197   0.82941044  0.95697488  0.91911818  0.89646221
  0.95648582  0.82867804  0.86641402  0.90234113  0.8249307   0.84232856
  0.96270042  0.79657538  0.94041663  0.96623176  0.91256236  0.86927299
  0.85102998  0.91242885  0.94476305  0.92849752  0.91885043  0.89906699
  0.9588847   0.93795866  0.85366833  0.92578297  0.78687871  0.92771848
  0.86681408  0.94248861  0.92008639  0.72333353  0.91196625  0.89287268
  0.94894533  0.95092437  0.85828994 -0.22331467  0.93460931  0.9242971
  0.7840094   0.8523275 ]

 Average model Accuracy: 86.728% with std. dev - (16.758%)
  • With default values only we are able to get a significant better value then Decission tree
  • Train Data set is close to 98% accuracy which hints towards overfit but test data set is as well around 89.8%, which is not very far from train accuracy
  • Model seems slightly over fit but we can give it a try with grid search to see if its make the result any better and try to close the gap between train and test accuracies.
  • Grid search on Random Forest as its default is better than Decission tree
In [499]:
from sklearn.model_selection import GridSearchCV

parameters = {'bootstrap': [True],
 'max_depth': [5, 10, 15, 20, 25, 30, 35, 40, 50],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4, 8],
 'n_estimators': [100]}


clf = GridSearchCV(RandomForestRegressor(), 
                   parameters, 
                   cv = 5, 
                   verbose = 2, 
                   n_jobs= 4)
clf.fit(X_train, y_train)

clf.best_params_
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    4.4s
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:   13.9s
[Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed:   31.3s finished
Out[499]:
{'bootstrap': True,
 'max_depth': 15,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'n_estimators': 100}
In [582]:
#Using the grid search is giving us better train data accuracy but its leading it to overfit zone
#so prunned the max depth after few iterations
model = RandomForestRegressor(bootstrap = True,
 max_depth= 9,
 max_features= 'sqrt',
 min_samples_leaf= 1,
 n_estimators= 100)

model.fit(X_train, y_train)



y_pred = model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
#model1 = model
results = cross_val_score(model, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

modelComp=modelComp.append(pd.DataFrame({'Model':['Random Forest Regressor'],
                                         'Train Accuracy':[model.score(X_train,y_train)],
                                         'Test Accuracy':[model.score(X_test , y_test)],
                                         'Kfold-Mean-Accuracy':[results.mean()],
                                         'Kfold-StdDeviation':[results.std()]}))
Performance on training data using DT: 0.9623898114811585
Performance on testing data using DT: 0.8958057580219252
Accuracy Test:  0.8958057580219253
MSE:  0.11347486853439115
[ 0.89337115  0.92242096  0.86218141  0.69785831  0.9393776   0.81435396
  0.87265418  0.94216634  0.81898511  0.94313846  0.89493794  0.84868517
  0.89928191  0.84725524  0.80917119  0.89353057  0.77246996  0.84685884
  0.95043553  0.79931677  0.95003111  0.96277287  0.92492218  0.84651461
  0.84279751  0.90792114  0.93496357  0.90726534  0.91711933  0.85169024
  0.95480236  0.90822071  0.8002323   0.92817817  0.82011239  0.9214755
  0.85508289  0.8999148   0.92083473  0.70042137  0.88878925  0.8993862
  0.9597521   0.91331319  0.85208645 -0.23951208  0.93854687  0.88115669
  0.79844409  0.83731235]

 Average model Accuracy: 85.506% with std. dev - (16.781%)
In [583]:
modelComp
Out[583]:
Model Train Accuracy Test Accuracy Kfold-Mean-Accuracy Kfold-StdDeviation
0 Decission Tree 0.915758 0.822194 0.750305 0.190860
0 Random Forest Regressor 0.962390 0.895806 0.855060 0.167812
In [ ]:
 

Attempting Random Forest with PCA to see the impact

In [501]:
X=df_scaled.iloc[:,0:8]
y = df_scaled.iloc[:,8]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = seed)

from sklearn.decomposition import PCA
pca = PCA(8)# Initialize PCA object

pca.fit(X_train)
Out[501]:
PCA(copy=True, iterated_power='auto', n_components=8, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [502]:
pca.explained_variance_
Out[502]:
array([2.35485693, 1.45766583, 1.30602315, 1.02517164, 0.95561896,
       0.84788175, 0.17197646, 0.03033654])
In [503]:
pca = PCA(n_components=6)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)  # PCs for the train data
X_test_pca = pca.transform(X_test)    # PCs for the test data

X_train_pca.shape, X_test_pca.shape
Out[503]:
((721, 6), (309, 6))
In [504]:
rf = RandomForestRegressor(bootstrap = True,
 max_depth= 15,
 max_features= 'sqrt',
 min_samples_leaf= 1,
 n_estimators= 100)

rf.fit(X_train_pca, y_train)

y_pred = rf.predict(X_test_pca)
# Train data accuracy
print('Performance on training data using DT:',rf.score(X_train_pca,y_train))
# test data accuracy
print('Performance on testing data using DT:',rf.score(X_test_pca,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
model1 = rf
results = cross_val_score(model1, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Performance on training data using DT: 0.9682607687349181
Performance on testing data using DT: 0.8350649055522287
Accuracy Test:  0.8350649055522288
MSE:  0.1796259352134505
[ 0.89723608  0.91178801  0.87702512  0.6709964   0.91457287  0.85511894
  0.90261399  0.94336822  0.78652961  0.93784618  0.86703203  0.78868843
  0.92546765  0.83069572  0.77822597  0.89002382  0.78334646  0.82877511
  0.9511783   0.79194306  0.94333988  0.9517162   0.9286389   0.87588208
  0.7544756   0.90223198  0.95345461  0.94846117  0.88473306  0.88194395
  0.95399263  0.93061389  0.80530737  0.9432384   0.79845417  0.9305672
  0.86399002  0.94607269  0.89874391  0.80116961  0.87596714  0.83143593
  0.92277125  0.94753577  0.90988627 -0.10836533  0.9281527   0.93669344
  0.75053236  0.86898368]

 Average model Accuracy: 85.786% with std. dev - (15.310%)
  • With PCA implemetation and our existing data set (with selective attributes), we can see similar impact
  • we will proceed with existing data set and not with PCA implemented data set for better understanding of relations
Gradient Boosting Regressor
In [594]:
X=df_scaled.iloc[:,0:8]
y = df_scaled.iloc[:,8]

#Removing less contributing features
X=X.drop(['ash','coarseagg','fineagg'],axis=1)

# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = seed)
In [506]:
model=GradientBoostingRegressor()
model.fit(X_train, y_train)
Out[506]:
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
In [507]:
y_pred = model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
model1 = model
results = cross_val_score(model1, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Performance on training data using DT: 0.9372683553277921
Performance on testing data using DT: 0.8981133364516951
Accuracy Test:  0.8981133364516951
MSE:  0.11096175308789619
[ 0.92439463  0.89959681  0.82058342  0.75761653  0.93153637  0.77965549
  0.91215735  0.92219971  0.85399774  0.95808461  0.90324795  0.85924804
  0.8692956   0.81838358  0.75087595  0.90503659  0.76701924  0.86652429
  0.95260206  0.76521185  0.94438462  0.9681356   0.89338348  0.85577745
  0.87475291  0.86541472  0.96105006  0.89298157  0.93590639  0.83171095
  0.93158489  0.87925596  0.79735437  0.95325642  0.85758671  0.90353833
  0.86609529  0.92033917  0.8978231   0.67751799  0.9135978   0.90495104
  0.94924067  0.92814905  0.90240576 -0.46797785  0.94852494  0.8737974
  0.61301817  0.86683322]

 Average model Accuracy: 84.715% with std. dev - (20.147%)
  • With Gradiant boost regressor we can notice the train and test set accuracies are close by compared to other models with default setting itself
  • We will try to get better close by numbers for train and Test to make it more generalized with out loosing much of test accuracies.
  • Cross val score is also better at accuracy of 85%
we can take the following approach:
  • Choose a relatively high learning rate. Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems
  • Determine the optimum number of trees for this learning rate. This should range around 5-10 as we noticed earlier dring decission tree as well. Remember to choose a value on which your system can work fairly fast. This is because it will be used for testing various scenarios and determining the tree parameters.
  • Tune tree-specific parameters for decided learning rate and number of trees. Note that we can choose different parameters to define a tree .
  • Lower the learning rate and increase the estimators proportionally to get more robust models.
In [ ]:
 
In [508]:
parameters = {
    "loss":['ls', 'lad', 'huber', 'quantile'],
    "learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
    #"min_samples_split": np.linspace(0.1, 0.5, 5),
    #"min_samples_leaf": np.linspace(0.1, 0.5, 5),
    "max_depth":[3,5,8],
    "max_features":["log2","sqrt"],
    "criterion": ["friedman_mse",  "mae"],
    "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
    "n_estimators":[10]
    }
clf = GridSearchCV(estimator = GradientBoostingRegressor(), 
                   param_grid = parameters, 
                   cv = 5, 
                   verbose = 2, 
                   n_jobs= 4)
clf.fit(X_train, y_train)

clf.best_params_
Fitting 5 folds for each of 2352 candidates, totalling 11760 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 128 tasks      | elapsed:    1.1s
[Parallel(n_jobs=4)]: Done 728 tasks      | elapsed:   17.1s
[Parallel(n_jobs=4)]: Done 1672 tasks      | elapsed:   38.1s
[Parallel(n_jobs=4)]: Done 3204 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done 4806 tasks      | elapsed:  1.8min
[Parallel(n_jobs=4)]: Done 6132 tasks      | elapsed:  2.3min
[Parallel(n_jobs=4)]: Done 7666 tasks      | elapsed:  3.2min
[Parallel(n_jobs=4)]: Done 8848 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done 10146 tasks      | elapsed:  4.8min
[Parallel(n_jobs=4)]: Done 11760 out of 11760 | elapsed:  5.8min finished
Out[508]:
{'criterion': 'friedman_mse',
 'learning_rate': 0.2,
 'loss': 'ls',
 'max_depth': 8,
 'max_features': 'log2',
 'n_estimators': 10,
 'subsample': 0.85}
In [510]:
#Splitted the param into two reduce the time
parameters = {
    "loss":['ls'],
    "learning_rate": [0.2],
    "min_samples_split": np.linspace(0.1, 0.5, 5),
    "min_samples_leaf": np.linspace(0.1, 0.5, 5),
    "max_depth":[8],
    "max_features":["log2"],
    "criterion": ["friedman_mse"],
    "subsample":[0.9],
    "n_estimators":[10,100]
    }
clf = GridSearchCV(estimator = GradientBoostingRegressor(), 
                   param_grid = parameters, 
                   cv = 5, 
                   verbose = 2, 
                   n_jobs= 4)
clf.fit(X_train, y_train)

clf.best_params_
Fitting 5 folds for each of 50 candidates, totalling 250 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  34 tasks      | elapsed:    2.4s
[Parallel(n_jobs=4)]: Done 243 out of 250 | elapsed:    4.8s remaining:    0.0s
[Parallel(n_jobs=4)]: Done 250 out of 250 | elapsed:    4.9s finished
Out[510]:
{'criterion': 'friedman_mse',
 'learning_rate': 0.2,
 'loss': 'ls',
 'max_depth': 8,
 'max_features': 'log2',
 'min_samples_leaf': 0.1,
 'min_samples_split': 0.2,
 'n_estimators': 100,
 'subsample': 0.9}

model=GradientBoostingRegressor(criterion= 'mae', learning_rate= 0.2, loss= 'huber', max_depth= 5, max_features= 'sqrt', n_estimators= 100, subsample= 0.9, min_samples_leaf=0.1, min_samples_split= 0.2,) model.fit(X_train, y_train)

In [584]:
model=GradientBoostingRegressor(criterion= 'friedman_mse', 
                                learning_rate= 0.2,
                                loss= 'ls',
                                max_depth= 8,
                                max_features= 'log2',
                                n_estimators= 100,
                                subsample= 0.9,
                                min_samples_leaf=0.1,
                                min_samples_split= 0.2,)
model.fit(X_train, y_train)
Out[584]:
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.2, loss='ls', max_depth=8,
                          max_features='log2', max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=0.1, min_samples_split=0.2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=0.9, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
In [585]:
y_pred = model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
#model1 = model
results = cross_val_score(model, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

modelComp=modelComp.append(pd.DataFrame({'Model':['GradientBoostingRegressor'],
                                         'Train Accuracy':[model.score(X_train,y_train)],
                                         'Test Accuracy':[model.score(X_test , y_test)],
                                         'Kfold-Mean-Accuracy':[results.mean()],
                                         'Kfold-StdDeviation':[results.std()]}))
Performance on training data using DT: 0.9214569286990772
Performance on testing data using DT: 0.89206494718136
Accuracy Test:  0.89206494718136
MSE:  0.11754887502732675
[ 0.8924366   0.89201497  0.81896291  0.64368623  0.93748412  0.84054178
  0.91718645  0.89982547  0.86783762  0.95967294  0.86286071  0.88147903
  0.91717167  0.89018931  0.75260612  0.96591238  0.79247193  0.87727727
  0.9542256   0.83687562  0.94560521  0.9572493   0.90582102  0.87228031
  0.91374075  0.82527244  0.91549934  0.8970678   0.94276321  0.85614006
  0.86929388  0.92217901  0.75985507  0.96142099  0.87851572  0.91784641
  0.8394202   0.94332152  0.93763565  0.72342947  0.90110565  0.85187494
  0.90423282  0.92301729  0.92365123 -0.03354631  0.92827095  0.9022093
  0.33040652  0.87324286]

 Average model Accuracy: 85.379% with std. dev - (16.145%)
  • Traing and test data sets accuracies are very close around 90% and the cross val score is as well close to 85%
  • This model is more geenralised as the train and test accuracies are very close at 92% and 89% respectively
In [586]:
modelComp
Out[586]:
Model Train Accuracy Test Accuracy Kfold-Mean-Accuracy Kfold-StdDeviation
0 Decission Tree 0.915758 0.822194 0.750305 0.190860
0 Random Forest Regressor 0.962390 0.895806 0.855060 0.167812
0 GradientBoostingRegressor 0.921457 0.892065 0.853791 0.161445
In [ ]:
 
AdaBoostRegressor
In [597]:
model=AdaBoostRegressor()
model.fit(X_train, y_train)
Out[597]:
AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=None)
In [598]:
y_pred = model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
model1 = model
results = cross_val_score(model1, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Performance on training data using DT: 0.8114798912114332
Performance on testing data using DT: 0.7914404243171765
Accuracy Test:  0.7914404243171766
MSE:  0.2271360680101382
[0.76518922 0.7942327  0.73070482 0.60372777 0.80254818 0.67242772
 0.77245899 0.82857562 0.66758728 0.7961885  0.79780496 0.67716136
 0.75401385 0.66376992 0.47425566 0.73652045 0.76353574 0.7049683
 0.82095303 0.66525442 0.85105983 0.89341637 0.7916936  0.79683046
 0.72247204 0.78515542 0.85966676 0.7101661  0.79758599 0.65781435
 0.79881453 0.75375682 0.65583253 0.8370168  0.79642189 0.76561136
 0.74264677 0.77301724 0.81921811 0.54523111 0.68608197 0.6616744
 0.8176945  0.8719268  0.73555149 0.11497572 0.85047865 0.73968547
 0.06583527 0.79372358]

 Average model Accuracy: 72.366% with std. dev - (15.257%)
In [ ]:
#Custom prameters
In [603]:
model=AdaBoostRegressor(n_estimators=100,
    learning_rate=1,
    loss='linear')
model.fit(X_train, y_train)
Out[603]:
AdaBoostRegressor(base_estimator=None, learning_rate=1, loss='linear',
                  n_estimators=100, random_state=None)
In [604]:
y_pred = model.predict(X_test)
# Train data accuracy
print('Performance on training data using DT:',model.score(X_train,y_train))
# test data accuracy
print('Performance on testing data using DT:',model.score(X_test,y_test))
#r2 score
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy Test: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
#kfold = KFold(n_splits=num_folds)
model1 = model
results = cross_val_score(model1, X_train, y_train, cv=kfold)
print(results)
print("\n Average model Accuracy: %.3f%% with std. dev - (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Performance on training data using DT: 0.8209571259263722
Performance on testing data using DT: 0.7940814068180938
Accuracy Test:  0.7940814068180938
MSE:  0.22425985204653193
[0.79287044 0.78691157 0.75734694 0.66616032 0.80552383 0.70580789
 0.77802377 0.82885075 0.6731208  0.83398967 0.85871897 0.53359487
 0.7771515  0.660484   0.52734474 0.80074122 0.75961956 0.72739641
 0.80926395 0.69573798 0.83735235 0.88806917 0.79223014 0.80186462
 0.69370042 0.78217658 0.82582297 0.73528328 0.7958509  0.7158486
 0.80729626 0.75912349 0.60907069 0.84217503 0.79483826 0.76468863
 0.79538958 0.8099638  0.81651796 0.53305206 0.66455002 0.68970224
 0.83410304 0.86148362 0.79839147 0.00138051 0.85312291 0.75380493
 0.00836331 0.81699498]

 Average model Accuracy: 72.922% with std. dev - (16.937%)
  • AdaBoost Regressor Traing and Test set accuracies are around 80%, so the model is a not a over fit or under fit model.
  • But the cross val score is low compared to other regreesors
  • We were able to sqeeze out little extra gain in both train and test accuracies but its still not comparable to other regressors.
In [605]:
modelComp
Out[605]:
Model Train Accuracy Test Accuracy Kfold-Mean-Accuracy Kfold-StdDeviation
0 Decission Tree 0.915758 0.822194 0.750305 0.190860
0 Random Forest Regressor 0.962390 0.895806 0.855060 0.167812
0 GradientBoostingRegressor 0.921457 0.892065 0.853791 0.161445
  • Comparing these selective details, we can notice the train and test accuracies are closest for gradient boost followed by Random forest regressor
  • Keeping in mind about overfitting modles we are tilting towrds GBM it has showed similar test accuracy and train accuracies were also close by.
  • So far we can consider Gradient boost as its been more generalized and its Train/Test/cross-val accuracies are nearby compared to other models
Bootstrap Sampling and Confidence Interval
In [395]:
from sklearn.utils import resample
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
data = X.join(y)
values = data.values
In [ ]:
 
In [401]:
#Random Forest
# Number of bootstrap samples to create
n_iterations = 1000        
# size of a bootstrap sample
n_size = int(len(data) * 0.5)    

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
stats = list()   
for i in range(n_iterations):
    # prepare train and test sets
    train = resample(values, n_samples=n_size)  # Sampling with replacement 
    test = np.array([x for x in values if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    
    
     # fit model
    model = RandomForestRegressor(
        bootstrap = True,
        max_depth= 8,
        max_features= 'sqrt',
        min_samples_leaf= 1,
        n_estimators= 100)
    # fit against independent variables and corresponding target values
    model.fit(train[:,:-1], train[:,-1]) 
    # Take the target column for all rows in test set

    y_test = test[:,-1]    
    # evaluate model
    # predict based on independent variables in the test data
    score = model.score(test[:, :-1] , y_test)
    predictions = model.predict(test[:, :-1])  

    stats.append(score)
In [402]:
# plot scores
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
95.0 confidence interval 81.6% and 86.6%
In [ ]:
 
In [398]:
#Gradient Boost
# Number of bootstrap samples to create
n_iterations = 1000        
# size of a bootstrap sample
n_size = int(len(data) * 0.5)    

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
stats = list()   
for i in range(n_iterations):
    # prepare train and test sets
    train = resample(values, n_samples=n_size)  # Sampling with replacement 
    test = np.array([x for x in values if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    
    
     # fit model
        
    model = GradientBoostingRegressor(criterion= 'mae', 
                                learning_rate= 0.2,
                                loss= 'huber',
                                max_depth= 5,
                                max_features= 'sqrt',
                                n_estimators= 100,
                                subsample= 0.9,
                                min_samples_leaf=0.1,
                                min_samples_split= 0.2,)
    # fit against independent variables and corresponding target values
    model.fit(train[:,:-1], train[:,-1]) 
    # Take the target column for all rows in test set

    y_test = test[:,-1]    
    # evaluate model
    # predict based on independent variables in the test data
    score = model.score(test[:, :-1] , y_test)
    predictions = model.predict(test[:, :-1])  

    stats.append(score)
In [400]:
# plot scores

pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
95.0 confidence interval 82.6% and 87.5%
In [ ]:
 
  • For Gradient Boost - 95.0 confidence interval 82.6% and 87.5%
  • For Random Forest - 95.0 confidence interval 81.6% and 86.6%
  • there is s slight advantage for Gradient Boost model compared to Random Forest
  • Comparing the model performance range at 95% confidence we can notice its a tough decission to decide between the models solely comparing the number. We have to weigh in the model design and what need to be considered for the data and complexity.
  • RF basically has only one hyperparameter to set: the number of features to randomly select at each node. However there is a rule-of-thumb to use the square root of the number of total features which works pretty well in most cases but need to look upon case by case. On the other hand, GBMs have several hyperparameters that include the number of trees, the depth (or number of leaves), and the shrinkage (or learning rate).
  • There is one fundamental difference in performance between the two that may force you to choose Random Forests over Gradient Boosted Machines (GBMs). That is, Random Forests can be easily deployed in a distributed fashion due to the fact that they can run in parallel, whereas Gradient Boosted Machines only run trial after trial.
  • Considering our scenario we have limited set of data hence the the cons for parallelism is not affecting us heavily. Along with that our data has very less linear relations and GBM approach of multiple iteration for adding weight might give us added advantage over RF.
Reasons for choosing GBM
  • RF overfit a sample of the training data and then reduces the overfit by simple averaging the predictors. But GBM repeatedly train trees or the residuals of the previous predictors.
  • RF is easy to use (less tuning parameters). we can blindly apply RF and can get decent performance with a little chance of overfit but without cross validation GBM is useless. GBM need much care to setup. As in GM we can tune the hyperparameters like no of trees, depth, learning rate so the prediction and performance is better than the Random forest.
  • With continuos learning and exposure to data we can fine tune GBM more accurately and in generalized form
In [ ]: